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Abstract 

In this paper we develop algorithms for approximating 
matrix multiplication with respect to the spectral norm. Let 
A G R"^"" and B e R"''" be two matrices and e> 0. We 
approximate the product A^ B using two sketches A £ R*'^'" 
and B £ R'^^, where f < n, such that 

j|l^B-^^s|| <e||A||2||B||2 

with high probability. We analyze two different sampling 

procedures for constructing A and B; one of them is done 
by i.i.d. non-uniform sampling rows from A and B and the 
other by taking random linear combinations of their rows. 
We prove bounds on t that depend only on the intrinsic 
dimensionality of A and B, that is their rank and their stable 
rank. 

For achieving bounds that depend on rank when taking 
random linear combinations we employ standard tools from 
high-dimensional geometry such as concentration of measure 
arguments combined with elaborate e-net constructions. For 
bounds that depend on the smaller parameter of stable 
rank this technology itself seems weak. However, we show 
that in combination with a simple truncation argument it 
is amenable to provide such bounds. To handle similar 
bounds for row sampling, we develop a novel matrix-valued 
Chernoff bound inequality which we call low rank matrix- 
valued Chernoff bound. Thanks to this inequality, we are 
able to give bounds that depend only on the stable rank of 
the input matrices. 

We highlight the usefulness of our approximate matrix 
multiplication bounds by supplying two applications. First 
we give an approximation algorithm for the i'2-regression 
problem that returns an approximate solution by randomly 
projecting the initial problem to dimensions linear on the 
rank of the constraint matrix. Second we give improved 
approximation algorithms for the low rank matrix approxi- 
mation problem with respect to the spectral norm. 

1 Introduction 

In many scientific applications, data is often naturally 
expressed as a matrix, and computational problems on 
such data are reduced to standard matrix operations 
including matrix multiplication, ^2-regression, and low 
rank matrix approximation. 
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In this paper we analyze several approximation 
algorithms with respect to these operations. All of 
our algorithms share a common underlying framework 
which can be described as follows: Let A be an input 
matrix that we may want to apply a matrix computation 
on it to infer some useful information about the data 
that it represents. The main idea is to work with a 
sample of A (a.k.a. sketch), call it A, and hope that the 
obtained information from A will be in some sense close 
to the information that would have been extracted from 
A. 

In this generality, the above approach (sometimes 
called "Monte-Carlo method for linear algebraic prob- 
lems") is ubiquitous, and is responsible for much of 
the development in fast matrix computations |FKV04| 
IDKM06al EE^m IDMM06T IAM071 [CW09l [DRT0| . 

As we sample A to create a sketch A, our goal 
is twofold: (i) guarantee that A resembles A in the 
relevant measure, and (ii) achieve such a A using as few 
samples as possible. The standard tool that provides a 
handle on these requirements when the objects are real 
numbers, is the Chernoff bound inequality. However, 
since we deal with matrices, we would like to have an 
analogous probabilistic tool suitable for matrices. Quite 
recently a non-trivial generalization of Chernoff bound 
type inequalities for matrix-valued random variables 
was introduced by Ahlswede and Winter |AW02j . Such 
inequalities are suitable for the type of problems that we 
will consider here. However, this type of inequalities and 
their variants that have been proposed in the literature 
jGLF+091 IRec09[ IGroOQl ITrolOj all suffer from the fact 
that their bounds depend on the dimensionality of the 
samples. We argue that in a wide range of applications, 
this dependency can be quite detrimental. 

Specifically, whenever the following two conditions 
hold we typically provide stronger bounds compared 
with the existing tools: (a) the input matrix has low 
intrinsic dimensionality such as rank or stable rank, 
(6) the matrix samples themselves have low rank. The 
validity of condition (a) is very common in applications 
from the simple fact that viewing data using matrices 
typically leads to redundant representations. Typical 
sampling methods tend to rely on extremely simple 



sampling matrices, i.e., sam.ples that are supported on 
only one entry jAHK06| IAM07| IDZlOj or samples that 
are obtained by the outer-product of the sampled rows 
or columns jDKMOGal IRV07J . therefore condition (&) 
is often natural to assume. By incorporating the rank 
assumption of the matrix samples on the above matrix- 
valued inequalities we are able to develop a "dimension- 
free" matrix- valued ChernofF bound. Sec Theorem 11.11 
for more details. 

Fundamental to the applications we derive, arc two 
probabilistic tools that provide concentration bounds 
of certain random matrices. These tools are inherently 
different, where each pertains to a different sampling 
procedure. In the first, we multiply the input matrix 
by a random sign matrix, whereas in the second we 
sample rows according to a distribution that depends 
on the input matrix. In particular, the first method is 
oblivious (the probability space does not depend on the 
input matrix) while the second is not. 

The first tool is the so-called subspace Johnson- 
Lindcnstrauss lemma. Such a result was obtained 
in |Sar06| (see also |Cla081 Theorem 1.3]) although 
it appears implicitly in results extending the original 
Johnson Lindenstrauss lemma (see |Mag07[ ). The 
techniques for proving such a result with possible worse 
bound are not new and can be traced back even to 
Milman's proof of Dvoretsky theorem jMilTlj . 



Lemma 1.1. (Subspace JL lemma \Sar06f ) LetW C R'^ 
be a linear subspace of dim,ension k and e € (0,1/3). 
Let R be a t X d random sign matrix resettled by l/\/t, 
namely Rij = ±1/Vt with equal probability. Then 

P ((1 - s) \\w\\l < \\Rw\\l < (1 + e) \\w\\l,\fwew) 
(1.1) > 1 - c^ exp(-ci A), 

where Ci > 0,C2 > 1 are constants. 

The importance of such a tool, is that it allows us to 
get bounds on the necessary dimensions of the random 
sign matrix in terms of the rank of the input matrices, 
see Theorem 13.21 (i.a). 

While the assumption that the input matrices have 
low rank is a fairly reasonable assumption, one should 
be a little cautious as the property of having low rank 
is not robust. Indeed, if random noise is added to a 
matrix, even if low rank, the matrix obtained will have 
full rank almost surely. On the other hand, it can be 
shown that the added noise cannot distort the Frobenius 
and operator norm significantly; which makes the notion 
of stable rank robust and so the assumption of low stable 
rank on the input is more applicable than the low rank 
assumption. 



Given the above discussion, we resort to a differ- 
ent methodology, called matrix- valued ChernofF bounds. 
These are non-trivial generalizations of the standard 
Chernoff bounds over the reals and were first intro- 
duced in JAW02J . Part of the contribution of the cur- 
rent work is to show that such inequalities, similarly 
to their real-valued ancestors, provide powerful tools 
to analyze randomized algorithms. There is a rapidly 
growing line of research exploiting the power of such 
inequalities including matrix approximation by sparsi- 
fication |AM07[ IDZIO] : analysis of algorithms for ma- 
trix completion and decomposition of low rank ma- 
trices jCR07| IGro09| IRec09| : and semi-definite relax- 
ation and rounding of quadratic maximization prob- 
lems |Nem07l[So09allSo09b| . 

The quality of these bounds can be measured by the 
number of samples needed in order to obtain small error 
probability. The original result of |AW02| Theorem 19] 
shows thatlll if M is distributed according to some 
distribution over n x n matrices with zero meaqj, and 
if Ml , . . . , Mt are independent copies of M then for any 
e>0. 



(1.2) 



1 * 



> e I < nexp ( — C— j 



where 1JM|]2 < 7 holds almost surely and C > is an 
absolute constant. 

Notice that the number of samples in Incq. (|1.2I) 
depends logarithmically in n. In general, unfortunately, 
such a dependency is inevitable: take for example a di- 
agonal random sign matrix of dimension n. The opera- 
tor norm of the sum of t independent samples is precisely 
the maximum deviation among n independent random 
walks of length t. In order to achieve a fixed bound on 
the maximum deviation with constant probability, it is 
easy to see that t should grow logarithmically with n in 
this scenario. 

In their seminal paper, Rudelson and Vershynin 
provide a matrix-valued Chernoff bound that avoids 
the dependency on the dimensions by assuming that 
the matrix samples are the outer product x <^ x of a 
randomly distributed vector x [RV07| . It turns out 
that this assumption is too strong in most applications, 
such as the ones we study in this work, and so we wish 
to relax it without increasing the bound significantly. 
In the following theorem we replace this assumption 
with that of having low rank. We should note that we 



^For ease of presentation we actually provide the restatement 
presented in IWX08I Theorem 2.6], which is more suitable for this 
discussion. 

Zero mean means that the (matrix- valued) expectation is the 
zero n X n matrix. 



are not aware of a simple way to extend Theorem 3.1 
of [RV07| to the low rank case, even constant rank. 
The main technical obstacle is the use of the powerful 
Rudelson selection lemma, see |Rud99] or Lemma 3.5 
of [RV07J ■ which applies only for Rademacher sums 
of outer product of vectors. We bypass this obstacle 
by proving a more general lemma, see Lemma 15.21 
The proof of Lemma 15.21 relies on the non-commutative 
Khintchine moment inequality |LP86[ IBucOl] which is 
also the backbone in the proof of Rudelson's selection 
lemma. With Lemma 15.21 at our disposal, the proof 
techniques of |RV07| can be adapted to support our 
more general condition. 

Theorem 1.1. Let < e < 1 and M be a random 
symmetric real matrix with IJEMJIj < 1 and ||M||2 < 7 
almost surely. Assume that each element on the support 
of M has at most rank r. Set t = i^{j \og{'y / e'^) / e'^) . If 
r < t holds almost surely, then 






> 



< 



1 



poly (t) 



where Mi, M2, ■ ■ ■ , Mt are i.i.d. copies of M . 

Proof. See Appendix, page [12] 

Remark 1. (Optimality) The above theorem cannot 
be improved in terms of the number of samples required 
without changing its form, since in the special case 
where the rank of the samples is one it is exactly 
the statement of Theorem 3.1 of \RV07l , see \RV07[ 
Remark 3.4/. 

We highlight the usefulness of the above main tools 
by first proving a "dimension-free" approximation al- 
gorithm for matrix multiplication with respect to the 
spectral norm (Section 13. ip . Utilizing this matrix mul- 
tiplication bound we get an approximation algorithm for 
the ^2-i'egression problem which returns an approximate 
solution by randomly projecting the initial problem to 
dimensions linear on the rank of the constraint matrix 
(Section 13. 2p . Finally, in Section [331 we give improved 
approximation algorithms for the low rank matrix ap- 
proximation problem with respect to the spectral norm, 
and moreover answer in the affirmative a question left 
open by the authors of |NDT09| . 

2 Preliminaries and Definitions 

The next discussion reviews several definitions and facts 
from linear algebra; for more details, see |SS901 IGV961 
IBha96[ . We abbreviate the terms independently and 
identically distributed and almost surely with i.i.d. and 
a.s., respectively. We let §"-1 := {x G M" | ||a:||2 = 



1} be the {n — l)-dimensional sphere. A random 
Gaussian matrix is a matrix whose entries are i.i.d. 
standard Gaussians, and a random sign matrix is a 
matrix whose entries are independent Bernoulli random 
variables, that is they take values from {±1} with 
equal probability. For a matrix A G R"><", A(,), A^^\ 
denote the I'th row, j'th column, respectively. For a 
matrix with rank r, the Singular Value Decomposition 
(SVD) of A is the decomposition of A as UT,V^ where 
[/ g j^nxr j^ g jgmxr ^^^^.^ ^j^g columus of U and V are 

orthonormal, and S = diag((Ti(A), . . . ,ar{A)) is r x r 
diagonal matrix. We further assume cti > . . . > cr^ > 
and call these real numbers the singular values of A. By 
Ak ~ Uk'SikVjJ we denote the best rank k approximation 
to A, where Uk and Vk are the matrices formed by the 
first k columns of U and V, respectively. We denote by 



l^!l 



,{\\Aa 



I a; II 2 = 1} the spectral norm of 



, „ Af- the Frobenius norm of A. 



A,andbyP||p = ^^ 

We denote by A^ the Moore-Penrose pseudo-inverse of 
A,i.e.,A^ =VE-W^. Notice that CTi (A) = ||A||2. Also 
we define by sr{A) := \\A\\p / \\A\\:^ the stable rank of 
A. Notice that the inequality sr (A) < rank (A) always 
holds. The orthogonal projector of a matrix A onto the 
row-space of a matrix C is denoted by Pc{A) ~ AC^C. 
By Pc,kiA) we define the best rank-A; approximation of 
the matrix Pc{A). 

3 Applications 

All the proofs of this section have been deferred to 
Section [H 

3.1 Matrix Multiplication The seminal research 
of |FKV04J focuses on using non-uniform row sampling 
to speed-up the running time of several matrix com- 
putations. The subsequent developments of |DKM06al 
IDKMOebl IDKM06cj also study the performance of 
Monte-Carlo algorithms on primitive matrix algorithms 
including the matrix multiplication problem with re- 
spect to the Frobenius norm. Sarlos [Sar06| extended 
(and improved) this line of research using random pro- 
jections. Most of the bounds for approximating matrix 
multiplication in the literature are mostly with respect 
to the Frobenius norm |DKM06a[ ISar06[ ICW09| . In 
some cases, the techniques that are utilized for bound- 
ing the Frobenius norm also imply weak bounds for the 
spectral norm, see [DKM06al Theorem 4] or jSar06[ 
Corollary 11] which is similar with part (i.a) of The- 
orem [321 

In this section we develop approximation algorithms 
for matrix multiplication with respect to the spectral 
norm. The algorithms that will be presented in this 
section are based on the tools mentioned in Section [TJ 



Variants of Matrix- valued Inequalities | 


Assumption on the sample M 


# of samples (t) 


Failure Prob. 


References 


Comments 


||Af I2 < 7 a.s. 


f7(7Mog(7i)/e^) 


1/poly (n) 


[wxosi 


Hoeffding 


IIMII2 < 7 a.s., EKP ^ < p^ 


0((p^ + 7e/3) log(n)/£^) 


1/poly (n) 


BccOg 


Bernstein 


IIMII2 < 7 a.s., M = x(gix, IIEA/II2 < 1 


n(7log(7/e^)/e^) 


exp(-r!(e^t/(7logt))) 


fRV07, 


Rank one 


IIMII2 <7, rank(M) < t a.s., [[EMW^ < 1 


n(7log(7/e^)/e^) 


1/poly (i) 


Theoremll.il 


Low rank 



Table 1: Summary of m.atrix- valued Chernoff bounds. AI is a probability distribution over symmetric n x n 
matrices. Mi, . . . ,Mt are i.i.d. copies of M. 



Before stating our main dimension-free matrix multi- 
plication theorem (Theorem 13. 2p . we discuss the best 
possible bound that can be achieved using the current 
known matrix-valued inequalities (to the best of our 
knowledge). Consider a direct application of Ineq. (|1.2p . 
where a similar analysis with that in proof of The- 
orem 13.21 (ii) would allow us to achieve a bound of 
fl{r^ log(m -t- p)/e'^) on the number of samples (de- 
tails omitted). However, as the next theorem indicates 
(proof omitted) we can get linear dependency on the 
stable rank of the input matrices gaining from the "vari- 
ance information" of the samples; more precisely, this 
can be achieved by applying the matrix- valued Bern- 
stein Inequality see e.g. [GLF+OQ] . |Rec09[ Theorem 3.2] 
or jTrolOi Theorem 2.10]. 

Theorem 3.1. Let < e < 1/2 and let A e R"''™, 
B e M"'^^ both having stable rank at most r. The 
following hold: 

(i) Let R be a t X n random sign matrix resettled by 
l/Vi. Denote by A = RA and B = RB. If 
t = ri(rlog(m +p)/e'^) then 



A' B~A' B 



<e\\A\\ 



\B\\ 



> 1- 



1 



poly (?) ' 



I^(»)ll2 11-^(^)112 Z*^' '^^'^'^(^ ^ 



(ii) Let Pi = 

SiLi II ^(j) II 2 11-^(0 II 2 ^^ ^ probability distribution 
over [n] . If we form a t x m matrix A and a t x p 
matrix B by taking t = ^{r\og{m -f p)/£^) i.i.d. 
(row indices) samples from pi , then 



A^B - A^B 



<^mABh]>i- ^^^^^^. 



Notice that the above bounds depend linearly on the 
stable rank of the matrices and logarithmically on their 
dimensions. As we will see in the next theorem we 
can remove the dependency on the dimensions, and 
replace it with the stable rank. Recall that in most 
cases matrices do have low stable rank, which is much 
smaller that their dimensionality. 



Theorem 3.2. Let < e < 1/2 and let A e M"><™, 
B £ R"^P both having rank and stable rank at most r 
and r, respectively. The following hold: 

(i) Let R be a t X n random sign matrix rescaled by 
1/Vt. Denote by A = RA and B = RB. 



(a) Ift = n{r/e^) then 

P(Vx e M™, y e R^, \x'^{A^B - A'^ B)y\ 
<e\\Ax\\,\\By\\,)>l-e-''^^l 

(b) Ift = n{¥/e'^) then 



A^B - A^B 



<e\\A\\.,\\B\\A>l-e 



-fX:&) 



(ii) Let pi = ||A(j)||2 ||i?(i)||2 /-S*, where S = 
X^iLi Il^(*)ll2 I|-^(«)ll2 ^^ '^ probability distribution 
over [n] . If we form a t x m matrix A and a 
t X p matrix B by taking t = r2(rlog(r/e^)/e^) 
i.i.d. (row indices) samples from pi, then 



A' B-A' B 



<e\\A\\^\\B\\^ 



> 1- 



1 



poly (?) ■ 



Remark 2. In part (ii), we can actually achieve the 
stronger bound of t = n{y^sr{A) sr{B) log(sr(yl) 
sr(B) /e'^)! e^) (see proof). However, for ease of pre- 
sentation and comparison we give the above displayed 
bound. 

Part {i.b) follows from {i.a) via a simple truncation ar- 
gument. This was pointed out to us by Mark Rudel- 
son (personal communication). To understand the sig- 
nificance and the differences between the different com- 
ponents of this theorem, we first note that the proba- 
bilistic event of part [i.a] is superior to the probabilistic 
event of {i.b) and {ii). Indeed, when B ^ A the former 
implies that |a;^(A^A 



^^A)a;| < £ • x'^A^Ax for ev- 



A^A~A^A 



ery x, which is stronger than 

We will heavily exploit this fact in Section [4.31 to prove 
Theorem l3.4l f i. a) and {ii). Also notice that part {i.b) is 



<e\\A\\ 



essential computationally inferior to (ii) as it gives the 
same bound while it is more expensive computationally 
to multiply the matrices by random sign matrices than 
just sampling their rows. However, the advantage of 
part (i) is that the sampling process is oblivious, i.e., 
does not depend on the input matrices. 

3.2 £2-regression In this section we present an ap- 
proximation algorithm for the least-squares regression 
problem; given ernnxm, n > m, real matrix A of rank r 
and a real vector b € K" we want to compute Xopt — A^b 
that minimizes ||^a; — 6II2 over all x S R"'. In their 
seminal paper JDMM06] , Drineas et al. show that if we 
non-uniformly sample t = ^{rn? /e^) rows from A and 6, 
then with high probability the optimum solution of the 
txd sampled problem will be within (1 -l-e) close to the 
original problem. The main drawback of their approach 
is that finding or even approximating the sampling prob- 
abilities is computationally intractable. Sarlos }Sar06| 
improved the above to t = rt{rn\ogm/e'^) and gave the 
first o{nrn?) relative error approximation algorithm for 
this problem. 

In the next theorem we eliminate the extra log m 
factor from Sarlos bounds, and more importantly, re- 
place the dimension (number of variables) m with 
the rank r of the constraints matrix A. We should 
point out that independently, the same bound as our 
Theorem 13.31 was recently obtained by Clarkson and 
Woodruff |CW09| (see also JDMMS09J V The proof of 
Clarkson and Woodruff uses heavy machinery and a 
completely different approach. In a nutshell they man- 
age to improve the matrix multiplication bound with 
respect to the Frobenius norm. They achieve this by 
bounding higher moments of the Frobenius norm of the 
approximation viewed as a random variable instead of 
bounding the local differences for each coordinate of the 
product. To do so, they rely on intricate moment calcu- 
lations spanning over four pages, see |CW09| for more. 
On the other hand, the proof of the present £2-regression 
bound uses only basic matrix analysis, elementary devi- 
ation bounds and e-net arguments. More precisely, we 
argue that Theorem 13.21 ii.a] immediately implies that 
by randomly-projecting to dimensions linear in the in- 
trinsic dimensionality of the constraints, i.e., the rank 
of A, is sufficient as the following theorem indicates. 



If t = f7(r/£^), then with high probability, 

'15- Ax. 



(3.4) \\Xopt- Xopt\\2< 



be a real matrix of rank 

.™ \\b-Ax\ 



Theorem 3.3. Let A<eW 

r and b G K". Let minj;gR™ ||5 — Aa;||2 be the £2- 
regression problem, where the minimum is achieved with 
Xopt = A^b. Let < £ < 1/3, R be at x n random sign 
matrix rescaled by l/yt and Xopt = (RA) Rb. 
• If t = ^[r/e), then with high probability, 
(3.3) 116 - Axoptl2 <(! + £) 11^ - AxoptW^ ■ 



O-min(A) 



opt 1 1 2 



Remark 3. The above result can be easily generalized 
to the case where b is an n x p matrix B of rank 
at most r (see proof). This is known as the gen- 
eralized £2-regression problem in the literature, i.e., 
argmiuxgmxp 11^^ ^ 5II2 where B is an n x p rank 
r matrix. 

3.3 Spectral Low Rank Matrix Approximation 

A large body of work on low rank matrix approxima- 
tions [DK03llF KV04. DRVWMl ISaHM iKVnTl IAM07I 
IRST091 ICW091 INDT091 iHMT09] has been recently de- 
veloped with main objective to develop more efficient 
algorithms for this task. Most of these results study ap- 
proximation algorithms with respect to the Frobenius 
norm, except for |RV07[ INDT09J that handle the spec- 
tral norm. 

In this section we present two (1 -I- e)-relative- 
error approximation algorithms for this problem with 
respect to the spectral norm, i.e., given an n x m, 
n > m, real matrix A of rank r, we wish to compute 
Ak — Uk'SkVfJ , which minimizes ||A — A^Hj over the 
set of n X 771 matrices of rank k, Xk. The first additive 
bound for this problem was obtained in |RV07| . To 
the best of our knowledge the best relative bound 
was recently achieved in jNDT09| Theorem 1]. The 
latter result is not directly comparable with ours, since 
it uses a more restricted projection methodology and 
so their bound is weaker compared to our results. 
The first algorithm randomly projects the rows of the 
input matrix onto t dimension. Here, we set t to be 
either il{r/e'^) in which case we get an (1 4- e) error 
guarantee, or to be r2(fc/£^) in which case we show 
a (2 -|- s^J{r — k)/k) error approximation. In both 
cases the algorithm succeeds with high probability. The 
second approximation algorithm samples non-uniformly 
J7(r log(r/e^)/£^) rows from A in order to satisfy the 
(1 -I- e) guarantee with high probability. 

The following lemma (Lemma 13.11) is essential for 
proving both relative error bounds of Theorem 13.41 It 
gives a sufficient condition that any matrix A should 
satisfy in order to get a (1 -l-e) spectral low rank matrix 
approximation of A for every k, 1 < k < rank (A). 

Lemma 3.1. Let A be an n x m matrix and e > 0. // 
there exists atxm matrix A such that for every x G M™ , 

(1 - £)x^ A^ Ax < x'^A^Ax < (1 4- e)x'^A'^Ax, then 



A-PaJA) <{l+e)\\A 



for every k = 1 , . . . , rank (A) . 



The theorem below shows that it's possible to sat- 
isfy the conditions of Lemma 13.11 by randomly project- 
ing A onto f2(r/£^) or by non-uniform sampling i.i.d. 
0(rlog(r/£^)/e^) rows of A as described in parts (i.a) 
and (ii), respectively. 

Theorem 3.4. Let < s < 1/3 and let A = C/EV^ be 
a real n x m matrix of rank r with n > m. 

(i) (a) Let R be a t x n random sign matrix resettled 
by 1/Vi and set A = RA. Ift = n{r/e'^), then 
with high probability 



A-PxJA) ^<(i + e)\\A-A4,, 



for every fc = 1, . . . , r. 

(b) Let R be a t X n random Gaussittn matrix 
resettled by l/v^ and set A = RA. If t — 
f2(A;/e^), then with high probttbility 



A-Pa.M) ^<(2 + £ 



r — k , 



WA-AkW,. 



(ii) Let p. 



U7(i\ /r be tt probttbility distribution 



Wll2 

over [n] . Let A be tt t x m matrix that is formed 
(row-by-row) by taking t i.i.d. samples from pi ttnd 
resettled appropriately. If t ~ r2(r log(r/£^)/£^), 
then with high probability 



A-P 



A.k 



{A) ^<{l + e)\\A-Auh. 



for every k = 1, . . . , r. 

We should highlight that in part [ii) the probability 
distribution pi is in general hard to compute. Indeed, 

II II 2 

computing ||C^(i)||2 requires computing the SVD of 
A. In general, these values are known as statistical 
leverage scores [DMlOj . In the special case where A 
is an edge-vertex matrix of an undirected weighted 
graph then pi, the probability distribution over edges 
(rows) , corresponds to the effective- resistance of the i- 
thedge [SS08] . 

Theorem 13.41 gives an (1 + e) approximation algo- 
rithm for the special case of low rank matrices. How- 
ever, as discussed in Section [T] such an assumption is 
too restrictive for most applications. In the following 
theorem, we make a step further and relax the rank 
condition with a condition that depends on the stable 
rank of the residual matrix A — Ak ■ More formally, for 
an integer fc > 1, we say that a matrix A has a k-low 
stable rank tail iff fc > sr (A — Ak). 

Notice that the above definition is useful since it 
contains the set of matrices whose spectrum follows 



a power-law distribution and those with exponentially 
decaying spectrum. Therefore the following theorem 
combined with the remark below (partially) answers in 
the affirmative the question posed by |NDT09| : Is there 
a relative error approximation algorithm with respect 
to the spectral norm when the spectrum of the input 
matrix decays in a power law? 

Theorem 3.5. Let < e < 1/3 and let A be a real 
n X m mtttrix with a fc-low stable rank tail. Let R be 
a t X n random sign matrix resettled by l/Vt ttnd set 
A = RA. If t = n{k/£^), then with high probttbility 



a-Pa,M) 



<{2 + s)\\A-Ak\\,. 



Remark 4. The (2 + e) bound ettn be improved to 
a relative (1 + e) error bound if we return tts the 
approximate solution a slightly higher rank mtttrix, i.e., 
by returning the mtttrix P^{A), which has rank at most 
t = n{k/e'^) (see ]HMT09[ Theorem Q.l]). 

4 Proofs 

4.1 Proof of Theorem [HI] (Matrix Multiplica- 
tion) 

Random Projections - Part (i) 

Part (a): In this section we show the first, to the 
best of our knowledge, non-trivial spectral bound for 
matrix multiplication. Although the proof is an imme- 
diate corollary of the subspace Johnson-Lindenstrauss 
lemma (Lemma II. ip . this result is powerful enough to 
give, for example, tight bounds for the £2 regression 
problem. We prove the following more general theorem 
from which Theorem 13.21 [i.a) follows by plugging in 
t = 17(r/e2). 

Theorem 4.1. LetA<^ K"^™ and B e W'^p. Assume 
that the ranks of A and B ttre at most r. Let R be a 
t X n random sign matrix resettled by l/yi- Denote by 
A = RA and B = RB. The following inequality holds 



Va;eR",2/e 



„T,' aT 



iA'B~A'B)y\<s\\Ax\\^\\By\\. 



> 1 — C2exp(— cie^t), 
where ci > 0, C2 > 1 ttre constants. 

Proof (of Theorem gT]) Let A = Ua^aVJ , B = 
C/bSbVj^ be the singular value decomposition of A and 
B respectively. Notice that Ua e W''''^,Ub € R"^''^, 
where r^ and r^ is the rank of A and B, respectively. 
Let Xi £ R™,a;2 € R'' two arbitrary unit vectors. 
Let wi = Axi and W2 = Bx2. Recall that 

\\A^ R^ RB - A^ B\\^ = 



sup \xj {A^ R^ RB - A^ B)x2\. 

We will bound the last term for any arbitrary vector. 
Denote with V the subspac(|f|colspan(t/^)Ucolspan(C/s) 
of R". Notice that the size of dim(V) < ta + ^b < 2r. 
Applying Lcmma ll.ll to V, we get that with probability 
at least 1 — Cj cxp(— cie^i) that 

(4.5) yveV: \\\Rv\\l-\\v\\l\<e\\v\\l. 

Therefore we get that for any unit vectors wi, W2 G V: 



{RviyRv2 = 



< 



iRvi + Rv2\\l - \\Rvi - Rv2\\l 



(1 + e) \\vi + U2II2 - (1 - e) hi - W2li2 



||wi+t'2|l2 - ll«l - W2II2 



,\\V1 +1^2112 + 1^1 -«2il2 



1'7'"2 



IW1II2 



1^2! 



vjv2 + e, 



where the first equality follows from the Parallelogram 
law, the first inequality follows from Equation (j4.5|) . 
and the last inequality since fi,U2 are unit vectors. 
By similar considerations we get that {Rvi)^ Rv2 > 



Vi V2 



By linearity of R, we get that 



The Gaussian distribution is symmetric, so Gij and 
VtRij ■ I Gij I, where Gij is a Gaussian random vari- 
able have the same distribution. By Jensen's inequal- 
ity and the fact that E|Gij| = a/S/tt, we get that 
y^EWRAh < E\\GA\\JVi^ 



f ■■ {±ir 



by f{S) 



Define the function 
■^^SA . The calcu- 



lation above shows that median(/) < v27r. Since / is 
convex and (1/Vt)-Lipschitz as a function of the entries 
of S, Talagrand's measure concentration inequality for 
convex functions yields 

P(||i?A||2 > mcdian(/) + 5) < 2cxp{-5^t/2). 

Setting 5 = 1 in the above inequality implies the lemma. 



Now using the above Lemma together with Theorem l3.2l 
(i.a) and a simple truncation argument we can prove 
part {i.b). 



Proof, (of Theorem 13.21 (i.b)) Without loss of gener- 
ality assume that WAh = ||^ll2 = 1- Set r = 

1600max{sr(A),sr(B)} , Set A = A - A,, B = B - B,. 



Since P||^=E;ri 

^IIf 



rank(A) 



a, (A)' 



A 



< 



< 



40 



and 



B 



< 



\B\ 



< 



40 



By triangle inequality, it follows that 



V'yi,W2 e V : \{Rviy Rv2 - vj V2\ < e ||t'i||2 IK'2||2 ■ 
Notice that wi,W2 G V, hence \wj R^ Rw2 — wj 11)21 — 

£||W'1!|2!|W2||2 =£\\Axih\\Bx2\\2- 

Part (6): We start with a technical lemma that 
bounds the spectral norm of any matrix A when it's 
multiplied by a random sign matrix rescaled by l/\/t. 



Lemma 4.1. Let A be an n x m real matrix, and let 
R be a t X n random sign matrix rescaled by l/yt. If 
t > sr{A), then 



(4.6) 



'(||i?A|l2>4P|l2) < 2e-*/^ 



Proof. Without loss of generality assume that J|j4||, = 1. 
Then ||A||p ~ -^/sr {A). Let G be a i x n Gaussian 
matrix. Then by the Gordon-Chevet inequalitjo 



EJIGAJI2 < withmi 



'tilF 



l^ll 



|-4||p + Vt < 2Vt. 



< \\a; R' RBr - a; Br 



A'R^ 


RBr 


2 










aJr^rb 


+ 
2 


A^R^RB 


2 


AjBr 


+ 

2 


AJ 


B 


+ 

2 


A" 


B 



A^B - A^B 

(4.7) 

(4.8) 
(4.9) 



Choose a constant in Theorem 13.21 (i.a) so that the 
failure probability of the right hand side of (|4.7p does 
not exceed cxp(— ce^t). where c = ci/32. The same 
argument shows that P (||i?^r|l2 ^ 1 + £) ^ exp(— ce^i) 
and V{\\RBr\\2 > 1 + e) < exp(-ce2t). This combined 
with Lemma 14.11 applied on A and B yields that the 
sum in (|48l) is less than 2(1 + e)£/10 + e^/100. Also, 
since ||A,.||2 , ||-Br|l2 ^ 1' the sum in (|4.9[) is less that 
2e/10 + eVioO. Combining the bounds for gj]), gH) 
and (14.91) concludes the claim. 



"^We denote by colspan(j4) the subspace generated by the 

columns of A, and rowspan(A) the subspace generated by the 

rows of A. 
4 



For example, set S 
tion 10.1, p. 54]. 



h.T 



A in IHMT09I Proposi- 



Row Sampling - Part (ii): By homogeneity 



normalize A and B such that 
Notice that A^ B = Y. 



\A\\2 = 



B 



= 1. 



where S = ^2^=1 



lAl)B(i). 



AJ., 



Define pi 



Lb, 



wl 



Also 



define a distribution over matrices in 
with n elements by 



P M 



1 

Pi 







^(T)^(o 



(m+p) X (m+p) 



p^- 



First notice that 



EM 



n ^ 

n 

E 



^(»)^(») 










B'^A 

A^B 



This implies that ||E71/||2 = IJA^SH^ < 1. Next notice 
that the spectral norm of the random matrix M is upper 
bounded by -^/sr (A) sr (B) almost surely. Indeed, 



lil/ll 



< 



sup 

is [n] 



^Ji)^(^) 



S sup 

ie[n] 



AJ., 



B, 



(») 



I^W|I2 II^W|I2 



EllA*)ll2ll%)ll2 - 



S-l 



I^IIfPIIf 



i=l 



= Vsr (A) sr (B) < (sr(A)+sr(B))/2, 

by definition of pj, properties of norms, Cauchy- 
Schwartz inequality, and arithmetic/geometric mean in- 
equality. Notice that this quantity (since the spectral 
norms of both A, B arc one) is at most r by assump- 
tion. Also notice that every element on the support of 
the random variable A/, has rank at most two. It is 
easy to see that, by setting 7 = r , all the conditions in 
Theorem 11.11 are satisfied, and hence we get ii,i2, . ■ ■ ,it 
indices from [n], t = ri(rlog(r/e^)/e^), such that with 
high probability 



1 



E 









B'^ A 

A'^B 



< 



The first sum can be rewritten as A^ B where A 

T 



1 

Vi 



1 /IT 



1 AT 



1 /|T 



1 pT 



This concludes the theorem. 



1 rT 



1 rT 



and 

T 



4.2 Proof of Theorem 13.31 (^2-regression) 



Proof, (of Theorem 13. 3p Similarly as the proof 
in [Sar06j . Let A ^ C/SF^ be the SVD of A. Let 
b = Axopt + w, where w € M" and w_Lcolspan(A). Also 
let A{xopt - Xopt) = Uy, where y G Ri'^^^l^C^). Our goal 
is to bound this quantity 



\b~Ax, 



opt 1 1 2 



^opt 

2 
2 



^opt 



) ~ AxoptW 



opt 1 1 2 



(4.10) 



AiXa 

Uy\\ 

+ \\Uy\\2 , since w±colspan(C/) 
since U^U ^I. 



I|2 

|y|l2' 



It suffices to bound the norm of y, i.e., ||y||2 < Se HwHj. 
Recall that given A, b the vector w is uniquely defined. 
On the other hand, vector y depends on the random 
projection R. Next we show the connection between y 
and w through the "normal equations" . 



RAx 



opt 



HAXopt 

JXJi-yXopt •^opt) 

U^R^RUy 
(4.11) U^R^RUy 



Rb + W2 =^ 

R{AXopt +w) +W2 = 
Rw + W2 => 

U^R^Rw + U^R^W2 
U^R^Rw, 



where W2J-Colspan(i?), and used this fact to de- 
rive Incq. (|4.1ip . A crucial observation is that the 
colspan([/) is perpendicular to w. Set A ~ B ~ U 
in Thcorcm l3.2[ and set e' = -y/e, and t = ri(r/e'^). No- 
tice that rank {A) + rank {B) < 2r, hcnce with constant 
probability we know that 1 — e' < ai{RU) < 1 -|- e'. It 
follows that ||C/^i?^i?C/y||2 > (1 - e')'^ WyW^. A similar 
argument (set A = U and i? = w in Theorem [32]) guar- 



antees that \\U'^ R'^ Rw\\^ = \\U'^ R'^ Rw - U'^ wW^ < 
£'||t/||2l|w|l2 = e'lkll2- RecaU that \\U\\^ = 1, since 
U^U ~ In with high probability. Therefore, taking Eu- 
clidean norms on both sides of Equation (|4.1ip we get 
that 



\y\\ 



< 



<4e'| 



{l-e'r 

Summing up, it follows from Equation (j4.10p that, 
with constant probability, ||6 — j4iEopt||2 < (1 + 

16e'^)\\b-Ax,pt\\l = (1 + 16e)\\b-Ax,pt\\l. This 
proves Ineq. (|3.3p . 

Ineq. (|3.4p follows directly from the bound on the 
norm of y repeating the above proof for e' -i— e. 
First recall that Xopt is in the row span of A, since 
Xopt = VE'^U^b and the columns of V span the 
row space of A. Similarly for Xopt since the row span 
of i? • ^ is contained in the row-span of A. Indeed, 

£ 



2 

rmn{A) 



> 



12/11 



\Uy\l 



\Aix. 



opt 



^opt ) 



> 



^opt 



^opt II2 ■ 



4.3 Proof of Theorems [SH, ]3^ (Spectral Low 
Rank Matrix Approximation) 

Proof, (of Lemma 13. ip By the assumption and using 
Lemma [01 we get that 



Lemma 4.2. Let A 



Ur-k^r-kV^_i., Hk 



Ur-k^r-k ciud R be any t x n matrix. If the matrix 
(RUk) has full column rank, then the following inequal- 
ity holds, 
(4.14) 



(4.12) {1 - e)<7i{A^ A) < a,{A^ A) < {1 + e)<7i{A^ A) \\A- P(RA),ki^)\ 



<2\\A-Ak 



{RUkYRHk 



for all i = 1, . . . ,rank(A). Let Hj. be the projcetion 
matrix onto the first k right singular vectors of A, i.e., 
{Ak) Ak- It follows that for every fc = 1, . . . , rank [A) 



PA.kiA) 



< 



< 



< 



A-AUk 


2 
2 




sup 


Ail - Uk)x 


2 
2 


sup \\Ax\\l 

a^Gkcrll/^, 1 a:| — 1 


sup x^ A^ Ax 

xGkcrllk, \\x\\—l 


(1 + e) sup x^A^Ax 

.Tekcrflfc, ||a;||2 = l 


(l + £)afc+i(I^I) 


(l + £)Vfe+ 


.{A-^A) 





(l + efWA-AkWl, 



using that x-LkeiLlfc implies IlkX — x, left side of the 
hypothesis, Courant-Fischer on A^ A (see Eqn. (|5.17p ). 
Eqn. (|4.12p . and properties of singular values, respec- 
tively. 

Proof of Theorem [SH (0 = 

Part (a): Now we are ready to prove our first 
corollary of our matrix multiplication result to the 
problem of computing an approximate low rank matrix 
approximation of a matrix with respect to the spectral 
norm (Theorem 13. 4p . 



Proof. Set A = -jfRA where i? is a r2(r/£^) x n random 
sign matrix. Apply Theorem 13.21 (i.a) on A we have 
with high probability that 
(4-13) 

Va; e R", {l-e)x^A'^Ax < x^ A^ Ax < {l+e)x'^ A'^ Ax. 



Combining Lemma 13.11 with Ineq. (|4.13p concludes the 
proof. 

Part (6): The proof is based on the following 
lemma which reduces the problem of low rank matrix 
approximation to the problem of bounding the norm 
of a random matrix. We restate it here for reader's 
convenience and completeness JNDT091 Lemma 8], (see 
also |HMT09i Theorem 9.1] or |BMD09j V 



Notice that the above lemma, reduces the problem of 
spectral low rank matrix approximation to a problem of 
approximation the spectral norm of the random matrix 
iRUk)^RHk. 

First notice that by setting t = fl{k/e'^) we can 
guarantee that the matrix (RUk) will have full column 
rank with high probability. Actually, we can say 
something much stronger; applying Theorem 13.21 (i.a) 
with A = Uk we can guarantee that all the singular 
values are within lie with high probability. Now by 
conditioning on the above event ( (RUk) has full column 
rank), it follows from Lemma W^ that 



\A-P^nA)AA)\\2 < 2||A-Afc||2+ [RUkYRHk 



^t, 



< 2||A-A,||2 + 

< 2||A-A,||2 + - 

< 2\\A-A, 



(RUk) 
1 



2 

RHkl 



k\\.2 + h\RUr- 



2 
RHk\\2 

-k\\2 l|Sr-fc|l2 



using the sub-multiplicative property of matrix norms, 
and that e < 1/3. Now, it suffices to bound the norm 

of W := RUr-k- RccaU that R = -^G where G 

v* 

is a t X ri random Gaussian matrix. It is well-known 
that the distribution of the random matrix GUr-k (by 
rotational invariance of the Gaussian distribution) has 
entries which are also i.i.d. Gaussian random variables. 
Now, we can use the following fact about random sub- 
Gaussian matrices to give a bound on the spectral norm 
of W . Indeed, we have the following 



Theorem 4.2. \RVO^ Proposition 2.3] Let W be a 
tx(r — k) random matrix whose entries are independent 
mean zero Gaussian random variables. Assume that 
r — k > t, then 



(4.15) 



\\w\\2 >sVV 



k] <e 



-cod" 



for any S > Sq, where Sq is a positive constant. 

Apply union bound on the above theorem with S be 
a sufficient large constant and on the conditions of 
Lemma 1321 we get that with high probability, ||VK||2 < 

CsVr - k and crmin((-RC/fe)^) < 1/(1 - e). Hence, 
Lemma 14.21 combined with the above discussion implies 
that 



the condition of Lemma 13.11 Indeed 



\A-P^nA)A^\, < 2\\A-A,\\., 



3/2\\RUr-kh\\A-Ak\\2 

2\\A-Akh 

^JGUr-kWJA-AkW, 



< 



046 



-k 



where C4 > is an absolute constant. Rescaling e by C4 
concludes Theorem 13.41 ii.K). 

Proof of Theorem 13.41 (ii) Here we prove that 
we can achieve the same relative error bound as with 
random projections by just sampling rows of A through 
a judiciously selected distribution. However, there is a 
price to pay and that's an extra logarithmic factor on 
the number of samples, as is stated in Theorem l3.41 part 
(n). 

Proof, (of Theorem 13.41 {ii)) The proof follows closely 
the proof of [SS08J . Similar with the proof of part (a). 
Let A ~ UTiV^ be the singular value decomposition 
of A. Define the projector matrix H = UU^ of size 
n X n. Clearly, the rank of H is equal to the rank of A 
and H has the same image with A since every element 
in the image of A and H is a linear combination of 
columns of U. Recall that for any projection matrix, the 
following holds H^ = H and hence sr (H) = rank (A) = 
r. Moreover, Er=i||^wll2 = ^^ (UU^) = tr (H) = 

tr (n^) = r. Let p^ = H(i,i)/r ~ ll^li)^/'' ^"^ ^ 
probability distribution on [n] , where Ui is the i-th row 
of [/. 

Define a, t x n random matrix S as follows: Pick t 
samples from pf, if the i-th sample is equal to j{E [n]) 
then set Sij = Xj ^fp]. Notice that S has exactly one 
non-zero entry in each row, hence it has t non-zero 
entries. Define A =- SA. 

It is easy to verify that EsHS'^S'H = H^ = H. 
Apply Theorem 11.11 (alternatively we can use |RV071 
Theorem 3.1], since the matrix samples are rank one) 
on the matrix H, notice that ||H||p = r and |1H|J2 = 1, 
||EsH5^5H|| < 1, hence the stable rank of H is 
r. Therefore, if t = ri(r log(r/e^)/£^) then with high 
probability 



sup 



x^(HS'TS'H-HH)a; 



sup 



|x^(HS'^S'H-HH)a:| 



sup 



l^-^fclla, 



sup 

8", Axi^ii 



sup 



sup 

xGR™, Axi^<d 



|y^(HS'^5'H-HH)y| 



\x^ AJ {J\S^ SXi~m\)Ax\ 
x^ A^ Ax 
\x^{A^S^SA- A^ A)x\ 



x^ A^ Ax 
x^(A^A~A^A)2 



< e <=> 



<£ <^ 



<e ^ 



<e ^ 



<e ^ 



v'^A'^Aa 



<£, 



since x ^ kcrH implies x £ Im{A), Im (A) = Im(H), 
and HA = A. By re-arranging terms we get Equa- 
tion (|4.13p and so the claim follows. 

Proof of Theorem 13. 5t Similarly with the proof 
of Theorem 13.41 (i.b). By following the proof of part 
(i.b), conditioning on the event that (RUk) has full 
column rank in Lemma l4.2l we get with high probability 
that 



A-PxM) „ < '2\\A-Akh + 



Uf^ R RHk 



using the fact that if {RUk) has full column 
rank then (RUk)"^ ^ {{RUkY RUk^^Uj R^ and 
\\{{RUkY RUkT^W^ ^ V(l - e)^- Now observe 
0. Since sr {Hk) < k, us- 

(i.b) with t = ri(fc/£^), we get 
= \\Uj R'^ RHk - Uj Hk\\^^ < 
Ak\\2 with high probability. 



that C/j,' Hk 
ing Theorem 
that WujR'^RHk 

£||C/,||2||i/fc||2 = 



£\\A 



Rescaling e concludes the proof. 
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(4.16) 



|H5^5n-nH||2 <e. 



It suffices to show that Ineq. (|4.16p is equivalent with 
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Appendix 

The next lemma states that if a symmetric positive 
semi-definite matrix A approximates the Rayleigh quo- 
tient of a symmetric positive semi-definite matrix A, 
then the eigenvalues of A also approximate the eigen- 
values of A. 

Lemma 5.1. Let < e < 1. Assume A, A are n x n 
symmetric positive semi-definite matrices, such that the 
following inequality holds 

{l-e)x'^Ax<x'^Ax<{l + e)x'^Ax, V x e M". 

Then, for i = 1, . . . ,n the eigenvalues of A and A are 
the same up-to an error factor e, i.e., 

(l-e)A.(^)<A,(I)<(H-£)A,(A). 

Proof. The proof is an immediate consequence of the 
Courant-Fischer's characterization of the eigenvalues. 
First notice that by hypothesis, A and A have the same 
null space. Hence we can assume without loss of gen- 
erality, that Xi{A),Xi{A) > for alH = 1, . . . ,n. Let 
Xi{A) and Xi{A) be the eigenvalues (in non-decreasing 
order) of A and A, respectively. The Courant-Fischer 
min-max theorem jGV96i p. 394] expresses the eigen- 
values as 



(5.17) 



X^iA) 



x^ Ax 



mm max ■ 



x^S' X ' X 



where the minimum is over all i-dimensional subspaces 
S^ . Let the subspaces Sq and S\ where the minimum 
is achieved for the eigenvalues of A and A, respectively. 
Then, it follows that 



x^ Ax X 



'^ Ax x^ Ax 



XAA) =minmax — =p — < max -p =p — < {l+e)Xi{A). 

S' xeS' X ' X xesi x ' Ax x ' x 



and similarly. 



x'^ Ax 



Axx^Ax Xi{A) 



Xi (A) — min max — =,= — < max — =,= — < 

S' kgS' a; ' a; xesl x'^ Ax ^^ l — e 

Therefore, it follows that for i = 1, . . . , n, 

(l-e)A,(^)<A,(I)<(l + £)A,(A). 
Proof of Theorem 11.11 For notational convenience. 



and define Ep :— 



let Z = i^*^,A/,-EM 

^Mi,M2,...,Mt ZP. Moreover, let Xi,X2, . ■ ■ ,Xn be 
copies of a (matrix- valued) random variables X, we will 
denote ^Xi,X2....,x„ by Ex[„]- Our goal is to give sharp 
bounds on the moments of the non-negative random 



variable Z and then using the moment method to give 
eoncentration result for Z . 

First we give a technical lemma of independent 
interest that bounds the p-th moments of Z as a function 
of p, r (the rank of the samples), and the p/2-th moment 



Now we are ready to prove Lemma 1 5. 2 



of the random variable 
have the following 



ELm^ 



More formally, we 



Proof, fof Lemma l5.2p The proof is inspired from }RV07[ 
Theorem 3.1]. Let p > 2. First, apply a standard 
symmetrization argument (see |LT91] ). which gives that 



Lemma 5.2. Let Mi,...,Mt be i.i.d. copies of M , 
where M is a symmetric matrix-valued random variable 
that has rank at most r almost surely. Then for every 
p>2 



E 



M[t] 



1 * 

- Y^ M, - E A/ 



< 2 



^■-Myt] ^e^t] 



\te.M. 



(5.18) 



Ep < rt 



i-p 



(2i?p)^EM,, 






p/2 



Indeed, let ei, £2, ■•-,£* denote independent Bernoulli 
variables. Let Mi, . . . , Mj, Mi, . . . , Mt be independent 
copies of M. We essential estimate the p-th root of Ep, 



where Bp is a constant that depends on p. 

We need a non-commutative version of Khintchine in- 
equality due to F. Lust-Piquard JLP86J . see also jLPP91| 
and |Bue011 Theorem 5]. We start with some prelimi- 
naries; let A e M"''" and denote by C^ the p-th Sehat- 
ten norm space— the Banach space of linear operators 
(or matrices in our setting) in M"— equipped with the 
norm 



(5.22) El'^ = (eM[, i^M,-EM 



p\ i/p 



(5.19) 



\A\\ ■.= (Y,a,{Ay 



i/p 



,i=l 



where ai{A) are the singular values of A, see jBha961 
Chapter IV, p. 92] for a discussion on Schatten norms. 
Notice that \\A\\^ = cri(A), hence we have the following 
inequality 



Notice that EM = E^ (tELi^^O- We plug this 
into (|5.22p and apply Jensen's inequality. 



/ 1 * 1 * ^ ^\ 

\ i=l 1=1 2/ 

1 * 1 * ~ "A 

1=1 i=l 2/ 



< Em„Ej^_^^ 



2^ 

P\ 1/P 



(5.20) 



|A||,<||A||c„<(rank(A))^/^||A| 



2 ' 



for any p > 1. Notice that when p — log2(rank(A)), 
then rank(A)i/'°s^<''''"''<'^^' = 2. Therefore, in this 
case, the Schatten norm is essentially the spectral norm. 
We are now ready to state the matrix- valued Khintchine 
inequality. See e.g. |Rud99| or |NDT091 Lemma 8]. 

Theorem 5.1. Assume 2 < p < 00. Then there exists 
a constant Bp such that for any sequence of t symmetric 
matrices Mi,...,Mt, with Mi e C" .such that the 
following inequalities hold 
(5.21) 



Now, notice that Mi — Mi is a symmetric matrix- valued 
random variable for every i £ [t], i.e., it is distributed 
identically with £i{Mi ~ Mi). Thus 



<P< Em^Ej^^^^E,,,, 



1 * ~ 



p\ i/p 



t 

E 

i=l 



EjMj 



1/p 



<B„ 



c?, 



E^^ 



1/2 



Denot^F = \ ^,^1 £^M^ and F = f ^Li £^M,. Then 
\\Y - Yr < {\\Y\\ + \\Y\\)P < 2Pi\\Y\\P + llrr), and 
E \\Y\\P = E \\Y\\P. Thus, we obtain that 



where for every i e [t], Ei is a Bernoulli random 
variable. Moreover, Bp is at ttioso 2^^''^^ii /e^. 



(5.23) i?y^ < 2 ( Em„ E,,, i^e.M, 



p\ i/p 



^See Eqn. (17) in | NDT09| or |Buc01| . 



Now by the Khintchine's inequality the following holds 



for any fixed symmetric matrices Mi, M2, . . . , Mj 



E, 






i=i 



< 7|lEe„ 






almost surely. Summing up all the inequalities we get 
that 



(5.25) 



Op 



t 




t 


1."' 


<7 


>:«. 


J=l 


2 


j=i 



< -s„ 



1/2 






It follows that 



< 



(5.24) 



[rtf'PBp 



{rtflPBp 



\ Ep < rt^-P{2BpY¥.M^,, 



Y.^11 
.j=i 



E^ 



j=i 



p/2 



2 < rti-P(2Bp)P7P/2E 



E^: 



E^^. 



i=i 



p/2 



taking 1/t outside the expectation and using the left 
part of Ineq. (|5.20p . Ineq. ()5.2ip . the right part of 

^^ / \l/2 

Ineq. ([Qg]) and the fact that the matrix [Y.]^i Mf \ 
has rank at most rt. 

Now raising Ineq. (|5.24p to the p-th power on 
both sides and then take expectation with respect to 
Ml, ... , Mt, it follows from Ineq. (|5?23| that 



< 



rti2Bp^)P 

rt{2Bp^)P 
tp/2 

rt{2B.p^)P 

tP/2 



E 



^ht] 



1 * 



E 



^^[ti 



E 



t 



2 

p/2 



-^Mj^EM + EM 



p/2 



E Mj - E M 



ri 



Ep < 2P--BPEm^^^ 






p/2 



< I^i^^iVi)! I , E 



tP/2 



1 * 

-"^Mj-EM 



This concludes the proof of Lemma 15.21 



rt{2Bp^)P ^^, 
^p 



tP/^ 



E'/P + 1 



p/2 




Now we are ready to prove Theorem ll.il First we can 

assume without loss of generality that Af ^ almost using Lemma \5l2\ Ineq. ()5.25p . Minkowski's inequality, 

surely losing only a constant factor in our bounds. Jensen's inequality, definition of Ep and the assumption 

Indeed, by the spectral decomposition theorem any ||EM||2 < 1. This implies the following inequality 



symmetric matrix can be written as A/ = ^ A 



] -J "J ^j 



Set M+ = J2\ >o\ujuJ and M_ = M - i\/+. It is ,^ „r.. 

clear that \\mX , 11^-112 < 11^112^ P^+IIf , P^-IIf < 
||M||p and rank(M+) ,rank(Af_) < rank(M). Triangle 



El^P < 



2Bp^{r t) 



i/p 



-{e;, 



i/p 



inequality tells us that 



1 * 

- E Mj - E A/ 



i=l 



< 



1 * 

-J2iMj)+-EAL 



EA/_ 



using that ^/TT^ < 1+x, x>0. Let Op = ^Bp^T^rt) " ^ 

Then it follows from the above inequality that Ep < 
^{eI'p + 1). It follows thalS min{£;p/P, 1} < Op. Also 
notice that 

(5.27) (EminjZ, 1}^)^/^ < min(£:yP, 1). 

Now for any < £ < 1, 



and one can bound each term of the right hand side 

separately. Hence, from now on we assume that A/ )^ r [Z > e) ~ r (mmjz, 1| > ej . 

a.s.. Now use the fact that for every j G [t], Af^ < j-Mj 



since Afj's are positive semi-definite and ||Ai^|]2 < 7 ^Indeed, if Ep''' < 1, then Sp'*' < ap. Otherwise 1 < ap 



,1/p 



By the moment method we have that 

P (mm{Z, 1} > e) = P (min{Z, 1}^ > e^) 

, /EminjZ, IP 
< mi ■ — '■ 

p>2 \ e / 

inf fl^^^^''^)'^'^ 



= mffc.^^'^*) 




p>2 \ eVt 



where C2 > is an absolute constant. 

Now assume that r < t and then set p = C2logi, 
where C2 > is a sufficient large constant, at the 
infimum expression in the above inequality, it follows 
that 



1 ' 

- Y^ Mi - E M 



>s] < [,^\^^^°g^(^*)'°"* 



C2 log t 



rVt 



We want to make the base of the above exponent smaller 
than one. It is easy to see that this is possible if we set 
t = Co-f/e^ \og{Coj/e^) where Co is sufficiently large 
absolute constant. Hence it implies that the above 
probability is at most l/poly(f). This concludes the 
proof. 



